Basic modeling in scikit-learn

python
datacamp
machine learning
models
Author

kakamana

Published

March 22, 2023

Basic modeling in scikit-learn

Validating models requires an understanding of how they are created and used. In this brief summary, you will learn how to run regression and classification models in Scikit-Learn. This model building foundation will be utilized throughout the remainder of this chapter.

This Basic modeling in scikit-learn is part of Datacamp course: Model Validation in Python which describe about model validation as t has never been easier to implement machine learning models than it is today. The results of running new data through a model may not be as accurate as expected without proper validation. Validation of models allows analysts to answer confidently the question, “How good is your model?”. This question will be addressed for classification models using the complete set of tic-tac-toe endgame scenarios, and for regression models using fivethirtyeight’s ultimate Halloween candy power ranking dataset. The purpose of this course is to introduce the basics of model validation, to discuss various validation techniques, and to begin to develop tools for creating high-performance and validated models.

This is my learning experience of data science through DataCamp. These repository contributions are part of my learning journey through my graduate program masters of applied data sciences (MADS) at University Of Michigan, DeepLearning.AI, Coursera & DataCamp. You can find my similar articles & more stories at my medium & LinkedIn profile. I am available at kaggle & github blogs & github repos. Thank you for your motivation, support & valuable feedback.

These include projects, coursework & notebook which I learned through my data science journey. They are created for reproducible & future reference purpose only. All source code, slides or screenshot are intellactual property of respective content authors. If you find these contents beneficial, kindly consider learning subscription from DeepLearning.AI Subscription, Coursera, DataCamp

Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as mae
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

Modeling View

Introduction to model validation

  • Model validation
    • Ensuring your model performs as expected on new data
    • Testing model performance on holdout datasets
    • Selecting the best model, parameters, and accuracy metrics
    • Achieving the best accuracy for the given data

Seen vs. unseen data

For example, predicting the popularity of Skittles will likely be more accurate than predicting the popularity of Andes Mints, since Skittles is included in the dataset, whereas Andes Mints is not.

You have built a model based on 50 candies using the dataset X_train and need to demonstrate how accurate it is in predicting the popularity of both the 50 candies based on which the model was built, and the 35 candies (X_test) it has never encountered before. As the accuracy metric, you will use the mean absolute error, mae().

Code
candy = pd.read_csv('dataset/candy-data.csv')
candy.head()
competitorname chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
0 100 Grand 1 0 1 0 0 1 0 1 0 0.732 0.860 66.971725
1 3 Musketeers 1 0 0 0 1 0 0 1 0 0.604 0.511 67.602936
2 One dime 0 0 0 0 0 0 0 0 0 0.011 0.116 32.261086
3 One quarter 0 0 0 0 0 0 0 0 0 0.011 0.511 46.116505
4 Air Heads 0 1 0 0 0 0 0 0 0 0.906 0.511 52.341465
Code
X = candy.drop(['competitorname', 'winpercent'], axis=1)
y = candy['winpercent']
Code
X
chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard bar pluribus sugarpercent pricepercent
0 1 0 1 0 0 1 0 1 0 0.732 0.860
1 1 0 0 0 1 0 0 1 0 0.604 0.511
2 0 0 0 0 0 0 0 0 0 0.011 0.116
3 0 0 0 0 0 0 0 0 0 0.011 0.511
4 0 1 0 0 0 0 0 0 0 0.906 0.511
... ... ... ... ... ... ... ... ... ... ... ...
80 0 1 0 0 0 0 0 0 0 0.220 0.116
81 0 1 0 0 0 0 1 0 0 0.093 0.116
82 0 1 0 0 0 0 0 0 1 0.313 0.313
83 0 0 1 0 0 0 1 0 0 0.186 0.267
84 1 0 0 0 0 1 0 0 1 0.872 0.848

85 rows × 11 columns

Code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

model = RandomForestRegressor(n_estimators=50)
Code
# The model is fit using X_train and y_train
model.fit(X_train, y_train)

# Create vectors of predictions
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

# Train/Test Errors
train_error = mae(y_true=y_train, y_pred=train_predictions)
test_error = mae(y_true=y_test, y_pred=test_predictions)

# Print the accuracy for seen and unseen data
print("Model error on seen data: {0:.2f}.".format(train_error))
print("Model error on unseen data: {0:.2f}.".format(test_error))
print("\nExcellent. When models perform differently on training and testing data, you should look to model validation to ensure you have the best performing model. In the next lesson, you will start building models to validate.")
Model error on seen data: 4.22.
Model error on unseen data: 9.59.

Excellent. When models perform differently on training and testing data, you should look to model validation to ensure you have the best performing model. In the next lesson, you will start building models to validate.

Regression Models

Regression MOdesl

Regression models

Random forest parameters
    n_estimators: the number of trees in the forest
    max_depth: the maximum depth of the trees
    random_state: random seed

Set parameters and fit a model

Predictive tasks fall into one of two categories: regression or classification. In the candy dataset, the outcome is a continuous variable describing how often the candy was chosen over another candy in a series of 1-on-1 match-ups. To predict this value (the win-percentage), you will use a regression model.

Code
rfr = RandomForestRegressor()
Code
# Set the number of trees
rfr.n_estimator = 100

# Add a maximum depth
rfr.max_depth = 6

# Set the random state
rfr.random_state = 1111

# Fit the model
rfr.fit(X_train, y_train)
RandomForestRegressor(max_depth=6, random_state=1111)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
print("\fWell done! You have updated parameters _after_ the model was initialized. This approach is helpful when you need to update parameters. Before making predictions, let's see which candy characteristics were most important to the model.")
Well done! You have updated parameters _after_ the model was initialized. This approach is helpful when you need to update parameters. Before making predictions, let's see which candy characteristics were most important to the model.

Feature importances

Despite the fact that some candy attributes, such as chocolate, may be extremely popular, they are not necessarily important to model predictions. Upon fitting a random forest model, it is possible to review the model’s attribute, .feature_importances_, to determine which variables had the greatest impact. By looping over the feature importance array using enumerate(), you can determine how important each variable was to the model.

Python’s enumerate() function can loop over a list and create an automatic counter at the same time.

Code
# Print how important each column is to the model
for i, item in enumerate(rfr.feature_importances_):
      # Use i and item to print out the feature importance of each column
    print("{0:s}: {1:.2f}".format(X_train.columns[i], item))
print("\nWell done. No surprise here - chocolate _is_ the most important variable. .feature_importances_ is a great way to see which variables were important to your random forest model.")
chocolate: 0.27
fruity: 0.04
caramel: 0.03
peanutyalmondy: 0.05
nougat: 0.01
crispedricewafer: 0.01
hard: 0.01
bar: 0.04
pluribus: 0.02
sugarpercent: 0.22
pricepercent: 0.30

Well done. No surprise here - chocolate _is_ the most important variable. .feature_importances_ is a great way to see which variables were important to your random forest model.

Classification models

Classification predictions

Classification model

In model validation, it is often important to know more about the predictions than just the final classification. When predicting who will win a game, most people are also interested in how likely it is a team will win.

Code
tic_tac_toe = pd.read_csv('dataset/tic-tac-toe.csv')
tic_tac_toe.head()
Top-Left Top-Middle Top-Right Middle-Left Middle-Middle Middle-Right Bottom-Left Bottom-Middle Bottom-Right Class
0 x x x x o o x o o positive
1 x x x x o o o x o positive
2 x x x x o o o o x positive
3 x x x x o o o b b positive
4 x x x x o o b o b positive
Code
y = tic_tac_toe['Class'].apply(lambda x: 1 if x == 'positive' else 0)
X = tic_tac_toe.drop('Class', axis=1)
X = pd.get_dummies(X)
Code
X
Top-Left_b Top-Left_o Top-Left_x Top-Middle_b Top-Middle_o Top-Middle_x Top-Right_b Top-Right_o Top-Right_x Middle-Left_b ... Middle-Right_x Bottom-Left_b Bottom-Left_o Bottom-Left_x Bottom-Middle_b Bottom-Middle_o Bottom-Middle_x Bottom-Right_b Bottom-Right_o Bottom-Right_x
0 0 0 1 0 0 1 0 0 1 0 ... 0 0 0 1 0 1 0 0 1 0
1 0 0 1 0 0 1 0 0 1 0 ... 0 0 1 0 0 0 1 0 1 0
2 0 0 1 0 0 1 0 0 1 0 ... 0 0 1 0 0 1 0 0 0 1
3 0 0 1 0 0 1 0 0 1 0 ... 0 0 1 0 1 0 0 1 0 0
4 0 0 1 0 0 1 0 0 1 0 ... 0 1 0 0 0 1 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
953 0 1 0 0 0 1 0 0 1 0 ... 0 0 1 0 0 0 1 0 0 1
954 0 1 0 0 0 1 0 1 0 0 ... 0 0 0 1 0 1 0 0 0 1
955 0 1 0 0 0 1 0 1 0 0 ... 1 0 0 1 0 1 0 0 0 1
956 0 1 0 0 0 1 0 1 0 0 ... 1 0 0 1 0 1 0 0 0 1
957 0 1 0 0 1 0 0 0 1 0 ... 0 0 1 0 0 0 1 0 0 1

958 rows × 27 columns

Code
y
0      1
1      1
2      1
3      1
4      1
      ..
953    0
954    0
955    0
956    0
957    0
Name: Class, Length: 958, dtype: int64
Code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8)
rfc = RandomForestClassifier()
Code
# Fit the rfc model.
rfc.fit(X_train, y_train)

# Create arrays of predictions
classification_predictions = rfc.predict(X_test)
probability_predictions = rfc.predict_proba(X_test)

# Print out count of binary predictions
print(pd.Series(classification_predictions).value_counts())

# Print the first value from probability_predictions
print('The first predicted probabilities are: {}'.format(probability_predictions[0]))
1    526
0    241
dtype: int64
The first predicted probabilities are: [0.6 0.4]
Code
print("\n Well done! You can see there were 526 observations where Player One was predicted to win the Tic-Tac-Toe game. Also, note that the predicted_probabilities array contains lists with only two values because you only have two possible responses (win or lose). Remember these two methods, as you will use them a lot throughout this course.")

 Well done! You can see there were 526 observations where Player One was predicted to win the Tic-Tac-Toe game. Also, note that the predicted_probabilities array contains lists with only two values because you only have two possible responses (win or lose). Remember these two methods, as you will use them a lot throughout this course.

Reusing model parameters

Replicating model performance is essential for model validation. It is also important to replicate model performance when sharing models with colleagues, reusing models on new data, or asking questions on websites such as Stack Overflow. It is possible to ask other coders about model errors, output, or performance using such a site. By reusing model parameters, you can replicate your work.

Code
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

# Print the classification model
print(rfc)

# Print the classification model's random state parameter
print('The random state is: {}'.format(rfc.random_state))

# Print all parameters
print('Printing the parameters dictionary: {}'.format(rfc.get_params()))
print("\nRecalling which parameters were used will be helpful going forward. Model validation and performance rely heavily on which parameters were used, and there is no way to replicate a model without keeping track of the parameters used!")
RandomForestClassifier(max_depth=6, n_estimators=50, random_state=1111)
The random state is: 1111
Printing the parameters dictionary: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 6, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 50, 'n_jobs': None, 'oob_score': False, 'random_state': 1111, 'verbose': 0, 'warm_start': False}

Recalling which parameters were used will be helpful going forward. Model validation and performance rely heavily on which parameters were used, and there is no way to replicate a model without keeping track of the parameters used!

Random forest classifier

This exercise reviews the four modeling steps discussed throughout this chapter using a random forest classification model. You will:

Create a random forest classification model.
Fit the model using the tic_tac_toe dataset.
Make predictions on whether Player One will win (1) or lose (0) the current game.
Finally, you will evaluate the overall accuracy of the model.
Code
# Create a random forest classifier
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

# Fit rfc using X_train and y_train
rfc.fit(X_train, y_train)

# Create predictions on X_test
predictions = rfc.predict(X_test)
print(predictions[0:5])

# Print model accuracy using score() and the testing data
print(rfc.score(X_test, y_test))
[0 1 0 0 1]
0.877444589308996
Code
print(" Notice the predict for games with 1, indicating that Player One is predicted to win those games. You also see the model accuracy was only 87%.")
 Notice the predict for games with 1, indicating that Player One is predicted to win those games. You also see the model accuracy was only 87%.